This README file was generated on 2021-01-22 by SIMON AXELROD.


GENERAL INFORMATION

1. Title of Dataset: Conformer models and training datasets

2. Author Information


	A. First author
	
		Name: Simon Axelrod
		Institution: Harvard University
		Address: 12 Oxford Street, Cambridge, MA, USA
		Email: simonaxelrod83@gmail.com

	B. Second author
		
		Name: Rafael Gomez-Bombarelli
		Institution: Massachusetts Institute of Technology
		Address: 77 Massachusetts Avenue, Cambridge, MA, USA
		Email: rafagb@mit.edu


3. Date of data generation: September 2020

4. Information about funding sources that supported the collection of the data: XSEDE COVID-19 HPC Consortium, project CHE200039. Computational resources provided by NASA Advanced Supercomputing (NAS) Division and LBNL National Energy Research Scientific Computing Center (NERSC), MIT Engaging
cluster, Harvard Cannon cluster, and MIT Lincoln Lab Supercloud.



SHARING/ACCESS INFORMATION

1. Licenses/restrictions placed on the data: CC0 1.0 Universal.

2. Links to publications that cite or use the data: https://arxiv.org/abs/2012.08452

3. Links to other publicly accessible locations of the data: None

4. Links/relationships to ancillary data sets: None

5. Was data derived from another source? No

6. Recommended citation for this dataset: Bibtex: @article{axelrod2020molecular,
  title={Molecular machine learning with conformer ensembles},
  author={Axelrod, Simon and Gomez-Bombarelli, Rafael},
  journal={arXiv preprint arXiv:2012.08452},
  year={2020}
}

DATA & FILE OVERVIEW

- The data is split into several sub-folders. To make sure you can easily see the structure of the folders and files, please use **tree view** on Dataverse.
	- To do so, navigate to the [dataset page](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/N4VLQL), and scroll down below **Description**, **Subject**, **Keyword**, etc., and past the "Files", "Metadata", "Terms", and "Versions" tabs.
	-  Right below that is **Change View**. Click "Tree", and you will be able to see all the files organized by folder and sub-folder.
	-  Note that files with the suffix ".csv" are automatically shown with the suffix ".tab" on Dataverse. So if we mention a file like "log\_human\_read.csv", which shows the change in model performance during training, then it will show up as "log\_human\_read.tab". You can download the file from Dataverse in either ".tab" or ".csv" format.

	
- There are two main folders:


1. `dsets`: contains neural force field (NFF) datasets and training splits used for training different models. It is divided into sub-folders for each of the different datasets:

		
	(a) `cov_1_cl`: Datasets for molecules with CoV 3CL protease data
			
	(b) `cov_2_cl`: Datasets for molecules with CoV-2 3CL protease data
			
	(c) `cov_2_gen`: Datasets for molecules with general CoV-2 inhibition data
			
	(d) `synthetic`: Datasets for training models on synthetic tasks used in the first GEOM paper (e.g. conformational free energy, number of unique conformers) 
	
	- Each of the first three folders contains a folder like `XXX_200` (XXX data with 200 conformers), `XXX_edge_idx` (XXX data with pre-computed indices of which edges are neighbors of which others), and/or `XXX_avg` (XXX data with a single effective conformer for each species, defined by using the average neighbor distances of all the conformers). 

	
	- Each also contains the folder `csvs`, which has files with the train/ validation/ test splits, either just as smiles files (e.g. `train_smiles.csv`, `test_smiles.csv`), or with both smiles and properties (e.g. `train_full.csv`, `test_full.csv`).
			
	- Inside sub-folders like `cov_2_cl_200`, `cov_2_gen_avg`, etc., you will see the folders `0, 1, ...` with different sections of the total dataset. This is useful for training in parallel so that different GPUs and nodes can load data from different folders. 


2. `models`: contains pre-trained models. It is divided into sub-folders for each of the different datasets, plus an extra folder with model statistics: 

	
	(a) `cov_1_cl`: Models trained on molecules with CoV 3CL protease data
	
	
	(b) `cov_2_cl`: Models trained on molecules CoV-2 3CL protease data
	
	
	(c) `cov_2_gen`: Models trained on molecules with general CoV-2 inhibition data
	
	
	(d) `synthetic`: Models trained on summary conformer information in the first GEOM dataset
	
	 
	(e) `stats`: Folder with some performance statistics of the different models

	- Each of the first four folders contains a folder like `cp3d_ndu`, `schnet_feat`, `schnet_feat_hyp` etc. The base names of the models are self-explanatory. Some suffixes you might see are `single` (trained on a single conformer instead of 200) and `hyp` (for hyperparameter optimization).

	
	- Each 3D model folder contains the following files:
		- `best_model`: the best model according to the validation metric used during training
		- `checkpoints`: a folder with all the models from each epoch in training
		- `log_human_read.csv`: a log file of the training and validation losses and metrics at each epoch
		- `params.json`: the parameters used to create and train the model. If you want to know the details of how a model was trained and what hyperparameters were used, this file will give you that information.
		- `pred_XXX_test.pickle`: the predictions, fingerprints, and possibly learned attention weights for each species in the dataset, using the model with the best validation XXX score
		- `scores_from_metrics.json`: a summary of the scores on various metrics for each of the different models


	- All the different ChemProp models, such as those that are trained to optimize different metrics or use additional non-learnable features, are in the folder `chemprop`
		- Each ChemProp folder has the normal ChemProp outputs, plus `test_metrics.json`, which reports the different scores of each ChemProp model on the test setusing different metrics. There are also output files called `test_pred_XXX.csv`, which contain the predictions for each molecule from the model of fold `XXX`.
		- The sub-folders have a mix of different names, but they all describe how the individual model was trained. For example, `cov_2_gen/chemprop/cp_auc` contains a ChemProp model trained to maximize the validation AUC, while `cov_2_gen/chemprop/from_whim_crest_weights_prc` contains a ChemProp model trained with WHIM features (mean and standard deviation for each species, calculated with weights from CREST calculations) to maximize the validation PRC. 
		- If it isn't clear how exactly a model was trained, you can always navigate to `args.json` or `verbose.log`, which contain all the details of the calculation.
	- For `cov_2_cl` there is also a folder called `transfer`: 
		- The folder `transfer/features` contains all the features generated by CoV 3CL models for molecules in the CoV-2 3CL dataset. For example, `transfer/features/schnet_feat` contains features generated by SchNetFeat trained on CoV 3CL, `transfer/features/from_cp` is the same for ChemProp, and `transfer/features/from_cp_whim` is the same for ChemProp trained with WHIM features (i.e., it contains the concatenation of WHIM features with learned ChemProp features). 
			- The folders with features from 3D models contain the features in two forms. The first are pickle files called `pred_XXX_test.pickle`, where `XXX` is the validation metric used to pick the best model during the original training. These files were generated by running `scripts/cp3d/transfer/get_fps/make_fps.sh`. The second are `npz` files in the form `test_XXX.npz`, which were converted to ChemProp-readable form by `scripts/cp3d/transfer/export_to_cp/save_feats.sh`. These are used by ChemProp during transfer learning.
			- **Note**: The folder `transfer/features/with_whim` contains files with *just the WHIM features alone*. These features are either calculated with equal weights for each conformer or with weights from CREST calculations (`_XXX_crest_YYY`). The latter were used to train all non-transfer-learned ChemProp + WHIM models, e.g. `models/cov_2_gen/chemprop/from_whim_crest_weights_auc`.
		- The folder `transfer/models` has all the models trained with these features. The base names of the model folders are self-explanatory, but there are a few additional suffixes too, like:
			-  `mpnn` / `no_mpnn` / `just_mpnn`: the first two are for whether ChemProp message-passing was used in addition to the fixed features, and the last one means only the MPNN and not the fixed features were used
			-   `from_auc` / `from_prc-auc` /`from_binary_cross_entropy`: which CoV 3CL validation metric was used for choosing the CoV 3CL model to use for fingerprinting
			-   `hyp`: for hyperparameter optimization.
			-   Folders containing `from_cp_whim`: transfer learned from ChemProp + WHIM models.
			-   Folders starting with `with_whim`: not actually transfer learned (they use the WHIM features alone and not the pre-trained ChemProp features), and should be ignored. 




	- The scikit learn folders (e.g. RF) contain the following files:
		- `best_params.json`: the optimized hyperparameters for this model
		- `config.json`: the config file used to create and train the model
		- `hyper_scores.json`: the scores on the validation set of each hyperparameter combination
		-  `pred.json`: predictions of the model on each species
		-  `score.json`: the scores of the model according to different metrics
		-  Some folders also contain files with `_pair_` inserted in the name, which means that atom-pair fingerprints were used.
	
	- The folder `models/cov_2_cl/schnet_feat/seed_0` also contains a sub-folder called `fp_analysis`. This sub-folder has two JSON files with information about fingerprint similarity for a sample of species. The similarity is computed among among pairs of hit species, pairs of misses, and hit-miss pairs. The fingerprints are E3FP fingerprints of a single conformer for each species. The conformer of each species is selected either randomly or by attention.
	- Some sub-folders are zipped. These include `folds.zip` (which contain the ChemProp models trained in different folds), `checkpoints.zip` (the checkpoints of 3D models stored at each epoch), and a variety of folders for ChemProp hyperparamter optimization. If you want to see the contents of the folders, you can unzip them using the `unzip` command on the command line.


3. Source code
	- Our code for generating and training the models can be found in the [Neural Force Field Repository](https://github.com/learningmatter-mit/NeuralForceField). 
	- The README file located at `scripts/cp3d/README.md` in the repository explains how to use the scripts to generate and train models. 
	- The notebook tutorial `tutorials/06_cp3d.ipynb` goes into some detail about what happens behind the scenes when making and training the models.